14 - MLPDES25: Remarks on matching measures with ML architectures [ID:57534]

50 von 164 angezeigt

Thank you.

Well, this will be the third talk on transformers today.

So you will be very used to all the part of the talk I'm going to give.

This is a joint work with Boris Aniskovsky, who is here, and Philippe Rigolet.

Okay, so I'll be speaking about transformers, what transformers are, even if you have already heard it a bit today.

And then I'll explain the problem of dealing, or just explaining the results and some hints of the proof.

So what are transformers? So basically you can, as you have experimented, you can open ChatGPT and write anything inside.

So here you can write, for instance, a sentence, and ChatGPT gives you a reply.

So basically one thing that's important is that you can also put inputs, you can write any sentence of the length you want, essentially, and then you get also a reply.

So basically, as also Voryan was introducing, you can, so this sentence, you can actually, what is actually is embedded in a sequence in RDN.

So you have N number, which is called tokens, and then each XN belongs to RD, where this dimension D is of order of hundreds or even thousands.

Okay, so, and now, so in the order of N, this number, this little N, the length of the sequence, can be of order very large, like a book or paragraph.

So you can think that this N is actually very, very large, much larger than the D.

So, and here I would like to go to something written on what is on the transformer, which is basically what you can find in the ChatGPT code.

So this is basically for the derivation of the model that you already have seen, but you have seen here just these two lines.

So this is really what this inside ChatGPT code, which is basically you have, you enter an X, you compute something called attention, you add it to X itself, and then you apply something called MLP to it.

And then you add it again, and this is looped over and over.

So this, as this can be actually translated into this, up to some small neglection of one normalization, so that you have, remind you have a sequence, X1 to XN, so you can, attention means that you can, you compute this in this complex combination, where I put the, everything I put in red are parameters that should be found by using some optimization algorithm.

And, it's okay, and this attention, and this MLP is basically something that applies to every element of the sequence independently, where the sigma is an activation function.

So basically what you have is like controlled, discrete controlled dynamical system of this sort, where all the, well, the controls are every, everything that you can, that you see in red.

So, the first thing is that you can, first as, you can also see that this can also come from a differential equation.

So you can see it by rescaling the parameters V and W, and with delta T and passing delta T to zero, or you can also even directly see that this is a literotor splitting of, if you do literotor splitting of this OV, you can actually get the, what is inside ChatGPT 2.

So, and basically, since, as it was also said before, since you have, since we have this normalization, this normalization also makes us that our dynamics is in the sphere.

So there has been several words on this. So first in this paper here, they were actually making this analogy with the literotor splitting.

There is also the, this very first paper from Gabriel as well, speaking about the transformers as basically mean field or basically particle systems.

Also the paper by Borean and Kralffers. And you can also see that this type of dynamics are essentially falls into the realm of collective behavior models, such as Kuramoto model or Kukeres mail and so on.

So basically, just to recall a bit, a bit of the terminology. So before we have, in the previous talks, we have been talking a lot about the self-attention dynamics, which is basically when we shut down this, the drift part.

So you can set this parameter W to zero and you get a dynamics like this. And if you were to, if we were to shut down the tension dynamics, we'll get something that I'll be calling like neural V on the sphere.

So basically now I'm going to motivate the problem. So basically, so first thing is that, okay, we have the sequence that we may have can have a very different length.

So remind that we can type anything we want in, in, in, in chat.jpd. We're going to actually put any, any size you want. So this, and also the end can be very large.

Okay. So basically, and also there is another property that also in the last, in the volume was talking about that the fact that when they encode it in a sentence, they do it in a way such that if you permute elements of the sequence, you can, you actually don't have any problem.

So basically it would, the right, one of the right possible settings is actually to look at it as, as, as measures. So because it allows, so if we permute the, anything here, we don't see anything, nothing happens.

And also it allows us to consider variable type of ends. And even since n is very large, we would even think of, of actually looking at this as an absolutely continuous measure or, or any arbitrary measure.

So what is the output? So this, this is like you will remind you from the talk before. So basically this thing, imagine that you have a sentence, like the something bark at me and you would like to know what is the, this something that bark, that bark at you.

So obviously you have like very few possibilities of which kind of animals you can bark at you. You will not have the word table. So basically you have, you would like have, you will have an input in which is a sentence and you would like to predict, for instance, what is either the next word or a part of the word inside.

And, and your output will be also a probability, a pretty measure, but you will have few atoms. So because there is not many possibilities that you can have. So basically the setting we have to think of is that you have the output is going to have typically very few atoms.

So, and basically what we would like to have is some sort of device that is going to map the different type of inputs with different, different lengths, with different possible lengths to, to outputs.

Okay. So basically we want to know what is basically the probability of the, of the next word given any, any sentence we are going to get. Okay. And we set it for N capital, N capital and inputs and N capital output. So it's like an assignment problem.

So basically since again, since we can actually look at, at the, at this sequence as a, as a basically as a, as a, as an empirical measure, this also the dynamics we were, that we were writing before also makes sense for, for the measure.

So you can actually write this equation, which is basically the associated continuity equation and where here the only thing is that we replace the, the summits by integrals.

So again, if you put here this, this mu to be the, an empirical, empirical measure, we would recover exactly the thing we had, we had before.

Okay. So now, so as, as I was saying before, we had also these, the so-called neural ODE's, which are basically differential equations of this sort, where again, this WUMB, you can choose them.

And basically what you are typically interested there is at the solution maps that you can get. So you just put an input X naught, you solve this ODE and you get X, XT.

And the flow map is basically the function you would be interested on. So basically neural ODE's are parameterized flow maps in, in RD or in, in the sphere, if you were to project this on the sphere.

Whereas transformers are parameterized flow maps in the space of probability measures via solving this continuity equation.

Okay. So in this, the, this vision also is, was always behind these papers here.

And, and now, so now that we understand that transformers are, are flow, are flow maps in the space of probability measures. And actually what is actually, what we actually would like to understand is that if you are given

N possible inputs and N possible outputs, so basically, and, and, and outputs. So basically you imagine that you have several texts and several books and you have a title for, for each, for each book.

And now you want to, to basically find parameters for the continuity equation such that when you solve the, this continuity equation, you get the solution is approximately the, the output we desire.

Okay. So this is basically an interpolation problem. So you are given a map in that goes from the space of probability measures to the space of probability measures.

Teil einer Videoserie :

MLPDES25 • Machine Learning and PDEs Workshop

Presenters

Dr. Domènec Ruiz-Balet

Zugänglich über

Offener Zugang

Dauer

00:26:24 Min

Aufnahmedatum

2025-04-29

Hochgeladen am

2025-04-30 15:39:36

Sprache

en-US

https://mod.fau.eu/mlpdes25/

#MLPDES25 Machine Learning and PDEs Workshop

Mon. – Wed. April 28 – 30, 2025

HOST: FAU MoD, Research Center for Mathematics of Data at FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen – Bavaria (Germany)

https://mod.fau.eu/mlpdes25/

SPEAKERS

• Paola Antonietti. Politecnico di Milano
• Alessandro Coclite. Politecnico di Bari
• Fariba Fahroo. Air Force Office of Scientific Research
• Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Borjan Geshkovski. Inria, Sorbonne Université
• Paola Goatin. Inria, Sophia-Antipolis
• Shi Jin. SJTU, Shanghai Jiao Tong University
• Alexander Keimer. Universität Rostock
• Felix J. Knutson. Air Force Office of Scientific Research
• Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Camilla Nobili. University of Surrey
• Gianluca Orlando. Politecnico di Bari
• Michele Palladino. Università degli Studi dell’Aquila
• Gabriel Peyré. CNRS, ENS-PSL
• Alessio Porretta. Università di Roma Tor Vergata
• Francesco Regazzoni. Politecnico di Milano
• Domènec Ruiz-Balet. Université Paris Dauphine
• Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Daniela Tonon. Università di Padova
• Juncheng Wei. Chinese University of Hong Kong
• Yaoyu Zhang. Shanghai Jiao Tong University
• Wei Zhu. Georgia Institute of Technology

SCIENTIFIC COMMITTEE

• Giuseppe Maria Coclite. Politecnico di Bari

• Enrique Zuazua. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

ORGANIZING COMMITTEE

• Darlis Bracho Tudares. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

• Nicola De Nitti. Università di Pisa

• Lorenzo Liverani. FAU DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg

SEE MORE: https://mod.fau.eu/mlpdes25/

Video teaser of the #MLPDES25 Workshop: https://youtu.be/4sJPBkXYw3M

#FAU #FAUMoD #MLPDES25 #workshop #erlangen #bavaria #germany #deutschland #mathematics #research #machinelearning #neuralnetworks

Tags

Per RSS abonnieren